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1  Abstract 

In  this  paper,  we  are  looking  into  the  adaptation  issues  of 
vocabulary-independent  (VI)  systems.  Just  as  with  speaker- 
adaptation  in  speaker-independent  system,  two  vocabulary 
adaptation  algorithms  [5]  are  implemented  in  order  to  tailor 
the  VI  subword  models  to  the  target  vocabulary.  The  first 
algorithm  is  to  generate  vocabulary-adapted  clustering  de¬ 
cision  trees  by  focusing  on  relevant  allophones  during  tree 
generation  and  reduces  the  VI  error  rate  by  9%.  The  second 
algorithm,  vocabulary-bias  training,  is  to  give  the  relevant 
allophones  more  prominence  by  assign  more  weight  to  them 
during  Baum- Welch  training  of  the  generalized  allophonic 
models  and  reduces  the  VI  error  rate  by  15%.  Finally,  in  order 
to  overcome  the  degradation  caused  by  the  different  acoustic 
environments  used  for  VI  training  and  testing,  CDCN  and 
ISDCN  originally  designed  for  microphone  adaptation  are  in¬ 
corporated  into  our  VI  system  and  both  reduce  the  degradation 
of  VI  cross-environment  recognition  by  50%. 

2  Introduction 

In  89’  and  91’  DARPA  Speech  and  Natural  Language  Work¬ 
shops  [8,  7],  we  have  shown  that  accurate  vocabulary- 
independent  (VI)  speech  recognition  is  possible.  However, 
there  are  many  anatomical  differences  between  tasks  (vocab¬ 
ularies),  such  as  the  size  of  the  vocabulary  and  the  frequency 
of  confusable  words.,  which  might  affect  the  acoustic  model¬ 
ing  techniques  to  achieve  optimal  performance  in  vocabulary- 
dependent  (VD)  systems.  For  example,  whole-word  models 
are  often  used  in  small-vocabulary  tasks,  while  subword  mod¬ 
els  must  be  used  in  large-vocabulary  tasks.  Moreover,  within 
a  limited  vocabulary,  it  is  possible  to  design  some  special  fea¬ 
tures  to  separate  the  confusable  models.  Therefore,  discrimi¬ 
native  training  techniques,  such  as  neural  networks  [10],  and 
maximum  mutual  information  estimator  (MMIE)  [4],  have  so 
much  success  in  small-vocabulary  tasks. 

Just  as  with  speaker  adaptation  in  speaker-independent 
systems,  it  is  desirable  to  implement  vocabulary  adapta¬ 
tion  to  make  the  VI  system  tailored  to  the  target  vocabulary 
(task).  Our  first  vocabulary  adaptation  algorithm  is  to  build 
vocabulary-adapted  allophonic  clustering  decision  trees  for 


the  target  vocabulary  based  on  only  the  relevant  allophones. 
The  adapted  trees  would  only  focus  on  the  relevant  contexts 
to  separate  the  relevant  allophones,  thus  give  the  resulting 
allophonic  clusters  more  discriminative  power  for  the  target 
vocabulary.  In  an  experiment  of  adapting  allophone  cluster¬ 
ing  tree  for  the  Resource  Management  task,  this  algorithm 
achieved  an  9%  error  reduction. 

Our  second  vocabulary  adaptation  algorithm  is  to  focus 
on  the  relevant  allophones  during  training  of  generalized  allo¬ 
phonic  models,  instead  of  focusing  on  them  during  generation 
of  allophonic  clustering  decision  trees.  To  achieve  that,  we 
give  the  relevant  allophones  more  prominence  by  assigning 
more  weight  to  the  relevant  allophones  during  Baum- Welch 
training  of  generalized  allophonic  models.  With  vocabulary- 
bias  training  we  are  able  to  reduce  the  VI  error  rate  by  15% 
for  the  Resource  Management  task. 

We  have  found  that  different  recording  environments  be¬ 
tween  training  and  testing  (CMU  vs.  TI)  will  degrade  the  per¬ 
formance  significantly  [6],  even  when  the  same  microphone 
is  used  in  either  case.  Based  on  the  framework  of  semi- 
continuous  HMMs,  we  proposed  to  update  codebook  proto¬ 
types  in  discrete  HMMs  in  order  to  fit  speech  vectors  from 
new  environments  [5].  Moreover,  codebook-dependent  cep- 
stral  normalization  (CDCN)  and  interpolated  SNR-dependent 
cepstral  normalization  (ISDCN)  proposed  by  Acero  et  al.  [2] 
for  microphone  adaptation  are  incorporated  into  the  our  VI 
system  to  achieve  environmental  robustness.  CDCN  uses 
the  speech  knowledge  represented  in  a  codebook  to  estimate 
the  noise  and  spectral  equalization  correction  vectors  for  en¬ 
vironmental  normalization.  In  ISDCN,  the  SNR-dependent 
correction  vectors  are  obtained  via  EM  algorithm  to  minimize 
the  VQ  distortion.  Both  algorithms  reduced  the  degradation 
of  VI  cross-environment  recognition  by  50%. 

In  this  paper,  we  first  describe  our  two  vocabulary  adap¬ 
tation  algorithms  ,  vocabulary-adapted  decision  trees  and 
vocabulary-bias  training.  Then  we  describe  the  codebook 
adaptation  algorithm  and  two  cepstral  normalization  tech¬ 
niques,  CDCN  and  ISDCN  for  environmental  robustness.  We 
will  also  present  results  with  these  vocabulary  and  environ¬ 
ment  adaptation  algorithms.  Finally,  we  will  close  with  some 
concluding  remark  about  this  work  and  future  work. 


168 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

1992 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-1992  to  00-00-1992 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Vocabulary  and  Environment  Adaptation  in  Vocabulary-Independent 

Ti _ 

5b.  GRANT  NUMBER 

rvccugiiiuuii 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Carnegie  Mellon  University, School  of  Computer 

Science, Pittsburgh, PA, 15213 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

6 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


3  Vocabulary  Adaptation 

Unlike  most  speaker  adaptation  techniques,  our  vocabulary 
adaptation  algorithms  only  take  advantage  of  analyzing  the 
target  vocabulary  and  thus  do  not  require  any  additional 
vocabulary-specific  data.  Two  terminologies  which  play  an 
essential  role  in  our  algorithms  are  defined  as  follows. 

relevant  allophones  Those  allophones  which  occur  in  the 
target  vocabulary  (task). 

irrelevant  allophones  Those  allophone  which  occur  in  the 
VI  training  set,  but  not  in  the  target  vocabulary 
(task). 

In  91’  DARPA  Speech  and  Natural  Language  Workshop 
[7],  we  have  shown  the  decision-tree  based  generalized  allo¬ 
phone  is  a  adequate  VI  subword  model.  Figure  1  is  an  example 
of  our  VI  subword  unit,  generalized  allophone,  which  is  ac¬ 
tually  an  allophonic  cluster.  The  allophones  in  the  white  area 
are  relevant  allophones  and  the  rest  are  irrelevant  ones. 


Figure  1;  A  generalized  allophone  (allophonic  cluster) 

3.1  Vocabulary-Adapted  Decision  TVee 

Our  first  vocabulary  adaptation  algorithm  is  to  change  the 
allophone  clustering  (the  decision  trees)  so  that  the  brand 
new  set  of  subword  models  would  have  a  more  discriminative 
power  for  the  target  vocabulary.  Since  the  clustering  decision 
tree  was  built  on  the  entire  VI  training  set,  the  existence  of  the 
enormous  irrelevant  allophones  might  result  in  sub-optimally 
clustering  of  allophones  for  the  target  vocabulary. 

To  reveal  such  facts,  let’s  look  at  the  following  scenario. 
Figure  2  is  a  split  in  the  original  decision  tree  for  phone 
/k/  generated  from  vocabulary-independent  training  set  and 
the  associated  question  for  this  split  is  "Is  the  left  context  a 
vowel".  Suppose  all  the  left  contexts  for  phone  /k/  in  the 
target  vocabulary  are  vowels.  Thus,  the  question  for  this  split 
is  totally  unsuitable  for  the  target  vocabulary  because  the  split 
assigns  all  the  allophones  for  /k/  in  the  target  vocabulary 
to  one  branch  and  discrimination  among  those  allophones 
becomes  impossible. 

On  the  other  hand,  if  only  the  relevant  allophones  are  con¬ 
sidered  for  this  split,  the  associated  split  question  would  turns 


out  to  be  the  one  of  relevant  questions  which  separates  the 
relevant  allophones  appropriately  and  therefore  possesses  the 
greatest  discriminative  ability  among  the  relevant  allophones. 
Figure  3  just  shows  such  optimal  split  for  relevant  allophones. 
The  generation  of  the  clustering  decision  trees  are  recursive. 
The  existence  of  enormous  irrelevant  allophones  would  pre¬ 
vent  the  generation  of  the  decision  trees  from  concentrating  on 
those  relevant  allophones  and  relevant  questions,  and  results 
in  sub-optimal  trees  for  those  relevant  allophones. 

Left  =  Vowel? 
y/\n 


m  irrelevant  allophones 
□  relevant  aUophnones 

Figure  2:  An  split(question)  in  the  original  decision  tree  for 
phone  /k/ 

Right  =  Liquid? 


m  irrelevant  allophones 
□  relevant  aUophnones 

Figure  3:  the  correspondent  optimal  split(question)  for  rele¬ 
vant  allophones  of  phone  /k/ 

Based  on  the  analysis,  our  first  adaptation  algorithm  is  to 
build  vocabulary-adapted  (VA)  decision  trees  by  using  only 
relevant  allophones  during  the  generation  of  decision  trees. 
The  adapted  trees  would  not  only  be  automatically  generated, 
but  also  focus  on  the  relevant  questions  to  separate  the  relevant 
allophones,  therefore  give  the  resulting  allophonic  clusters 
more  discriminative  power  for  the  target  vocabulary. 

Three  potential  problems  are  brought  up  when  one  exam¬ 
ining  the  algorithm  closely.  First  of  all,  some  relevant  allo¬ 
phones  might  not  occur  in  the  VI  training  set  since  we  can’t 
expect  100%  allophone  coverage  for  every  task,  especially 
for  large-vocabulary  task.  Nevertheless,  it  is  essential  to  have 
all  the  models  for  relevant  allophones  ready  before  generating 
the  VA  decision  trees  because  we  need  the  entropy  informa¬ 
tion  of  models  for  each  split.  It  is  trivial  for  those  relevant 
allophones  which  also  occur  in  VI  training  set.  The  correspon¬ 
dent  allophonic  models  trained  from  the  training  data  can  be 
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used  directly.  Because  of  the  nature  of  decision  trees,  every 
allophone  could  find  its  closest  generalized  allophonic  cluster 
by  traversing  the  decision  trees.  Therefore,  the  correspondent 
generalized  allophonic  models  could  be  used  as  the  models 
for  those  relevant  allophones  not  occurring  in  the  VI  training 
set  during  the  generation  of  the  VA  clustering  trees. 

Secondly,  if  only  the  part  of  VI  training  set  which  con¬ 
tains  the  relevant  allophones  is  used  to  train  new  generalized 
allophonic  models,  the  new  adapted  generalized  allophonic 
models  would  be  under-trained  and  less  robust.  Fortunately, 
we  can  retain  the  entire  training  set  because  of  the  the  nature 
of  decision  trees.  All  the  allophones  could  find  their  gener¬ 
alized  allophonic  clusters  by  traversing  the  new  VA  decision 
trees,  so  the  entire  VI  training  set  could  actually  contribute 
to  the  training  of  new  adapted  generalized  allophonic  models 
and  make  them  well-trained  and  robust. 

The  entropy  criterion  for  splitting  during  the  generation  of 
decision  trees  is  weighted  by  the  counts  (frequencies)  of  allo¬ 
phones  [6].  By  preferring  to  split  nodes  with  large  counts  (al¬ 
lophones  appearing  frequently),  the  counts  of  the  allophonic 
cluster  will  become  more  balanced  and  the  final  generalized 
allophonic  models  will  be  equally  trainable.  Since  the  VA  de¬ 
cision  tress  are  generated  from  the  set  of  relevant  allophones 
which  is  not  the  same  as  the  set  of  allophones  to  train  the 
generalized  allophonic  models.  The  balance  feature  of  those 
models  will  be  no  longer  valid.  Some  generalized  allophonic 
models  might  only  have  few  (or  even  none)  examples  in  the  VI 
training  set  and  thus  cannot  be  well-trained.  Fortunately,  we 
can  enhance  the  trainability  of  VA  subword  models  through 
gross  validation  with  the  entire  VI  training  set.  The  gross 
validation  for  VA  decision  trees  is  somehow  different  than  the 
conventional  cross  validation  which  uses  one  part  of  the  data 
to  grow  the  trees  and  the  other  part  of  independent  data  to 
prune  the  trees  in  order  to  predict  new  contexts.  Since  rele¬ 
vant  allophones  is  already  only  a  small  portion  of  the  entire  VI 
training  set,  further  dividing  it  will  prevent  the  learning  algo¬ 
rithm  from  generating  reliable  VA  decision  trees.  Instead,  we 
grow  the  VA  decision  trees  very  deeply;  replace  the  entropy 
reduction  information  of  each  split  by  traversing  through  the 
trees  with  all  the  allophones  (including  irrelevant  ones);  and 
finally  prune  the  trees  based  on  the  new  entropy  informa¬ 
tion.  This  will  prune  out  those  splits  of  nodes  without  enough 
training  support  (too  few  examples)  even  though  they  might 
be  relevant  to  the  target  vocabulary.  Therefore  the  resulting 
generalized  allophonic  models  will  become  more  trainable. 

The  vocabulary-adapted  decision  tree  learning  algorithm, 
emphasizing  the  relevant  allophones  during  growing  of  the 
decision  trees  and  using  the  gross  validation  with  the  entire  VI 
training  set  provides  an  ideal  mean  for  finding  the  equilibrium 
between  adaptability  for  the  target  vocabulary  and  trainability 
with  the  VI  training  database. 


3.2  Vocabulary-Bias  Training 

While  the  above  adaptation  algorithm  tailors  the  subword 
units  to  the  target  vocabulary  by  focusing  on  the  relevant  al¬ 
lophones  during  the  generation  of  clustering  decision  trees, 
it  treated  relevant  and  other  irrelevant  allophones  equally  in 
the  final  training  of  generalized  allophonic  models.  Our  next 
adaptation  algorithm  is  to  give  the  relevant  allophones  more 
prominence  during  the  training  of  generalized  allophonic 
models. 

Since  the  VI  training  database  is  supposed  to  be  very  large, 
it  is  reasonable  to  assume  that  the  irrelevant  allophones  are 
the  majority  of  almost  every  cluster.  Thus,  the  resulting  allo¬ 
phonic  cluster  will  more  likely  represent  the  acoustic  behavior 
of  the  set  of  irrelevant  allophones,  instead  of  the  set  of  relevant 
allophones. 

In  order  to  make  relevant  allophones  become  the  majority  of 
the  allophonic  cluster  without  incorporating  new  vocabulary- 
specific  data,  we  must  impose  a  bias  toward  the  relevant  al¬ 
lophones  during  training.  Since  our  VI  system  is  based  on 
HMM  approach,  it  is  trivial  to  give  the  relevant  allophones 
more  prominence  by  assigning  more  weight  to  them  during 
Baum- Welch  training.  The  simplest  way  is  to  multiply  a 
prominent  weight  to  the  parametric  re-estimation  equations 
for  relevant  allophones. 

The  prominent  weight  can  be  a  pre-defined  constant,  like 
2.0  or  3.0,  or  a  function  of  some  variables.  However,  it  is 
better  for  the  prominent  weight  to  reflect  the  reliability  of 
the  relevant  allophones  toward  which  we  imposed  a  bias. 
If  a  relevant  allophone  occur  rarely  in  the  fraining  set,  we 
shouldn’t  assign  a  large  weight  to  it  because  the  statistics  of 
it  is  not  reliable.  On  the  other  hand,  we  could  assign  larger 
weights  to  those  relevant  allophones  with  enough  examples 
in  the  training  data.  In  our  experiments,  we  use  a  simple 
function  based  on  the  frequencies  of  relevant  allophones.  All 
the  irrelevant  allophones  have  the  weight  1 .0  and  the  weight 
for  relevant  allophones  is  given  by  the  following  function: 

1  -I-  loga{x)  where  x  is  the  frequency  of  relevant  allophones 

a  is  chosen  to  be  the  minimum  number  of  training  examples 
to  train  a  reasonable  model  in  our  configuration. 

Imposing  a  bias  toward  the  relevant  allophones  is  similar  to 
duplicating  the  training  data  of  relevant  allophones.  For  ex¬ 
ample,  using  a  prominent  weight  of  2 .0  for  an  training  example 
in  the  Baum- Welch  re-estimation  is  like  observing  the  same 
training  example  twice.  Therefore,  our  vocabulary-bias  train¬ 
ing  algorithm  is  identical  to  duplicating  the  training  exam¬ 
ples  of  relevant  allophones  according  to  the  weight  function. 
Based  on  the  same  principle,  this  adaptation  algorithm  can  be 
applied  to  other  non-HMM  systems  by  duplicating  the  train¬ 
ing  data  of  relevant  allophones  to  make  relevant  allophones 
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become  the  majority  of  the  training  data  during  training.  The 
resulting  models  will  then  be  tailored  to  those  relevant  allo- 
phones. 

4  Environment  Adaptation 

It  is  well  known  that  when  a  system  is  trained  and  tested  under 
different  environments,  the  performance  of  recognition  drops 
moderately  [8]  However,  it  is  very  likely  for  training  and  test¬ 
ing  taking  place  under  different  environments  in  VI  systems 
because  the  VI  models  can  be  used  for  any  task  which  could 
happen  anywhere.  Even  if  the  recording  hardware  remains  un¬ 
changed,  e.g.,  microphones,  A/D  converters,  pre-amplifiers, 
etc,  the  other  environmental  factors,  e.g.  the  room  size,  back¬ 
ground  noise,  positions  of  microphones,  reverberation  firom 
surface  reflections,  etc,  are  all  out  of  the  control  realm.  For  ex¬ 
ample,  when  comparing  the  recording  environment  of  Texas 
Instruments  (TI)  and  Carnegie  Mellon  University  (CMU),  a 
few  differences  were  observed  although  both  used  the  same 
close-talk  microphone  (Sennheiser  HMD-414). 

•  Recording  equipment  -  TI  and  CMU  used  different  A/D 
devices,  filters  and  pre-amplifiers  which  might  change 
the  overall  transfer  function  and  thus  generate  different 
spectral  tilts  on  speech  signals. 

•  Room  -  The  TI  recording  took  place  in  a  sound-proof 
room,  while  the  CMU  recording  took  place  in  a  big  labo¬ 
ratory  with  much  background  noise  (mostly  paper  rustle, 
keyboard  noise,  and  other  conversations).  Therefore, 
CMU’s  data  tends  to  contain  more  additive  noise  than 
TI’s. 

•  Input  level  -  The  CMU  recording  process  always  ad¬ 
justed  the  amplifier’s  gain  control  for  different  speak¬ 
ers  to  compensate  the  varied  sound  volume  of  speakers. 
Since  the  sound  volume  of  TI’s  female  speakers  tends  to 
be  much  lower,  TI  probably  didn’t  adjust  the  gain  control 
like  CMU  did.  Therefore,  the  dynamic  range  of  CMU’s 
data  tends  to  be  larger. 

4.1  Codebook  Adaptation 

The  speech  signal  processing  of  our  VI  system  is  based  on  a 
characterization  of  speech  in  a  codebook  of  prototypical  mod¬ 
els  [7].  Typically  the  performance  of  systems  based  on  a  code¬ 
book  degrade  over  time  as  the  speech  signal  drifts  through  en¬ 
vironmental  changes  due  to  the  increased  distortion  between 
the  speech  and  the  codebook. 

Therefore,  two  possible  adaptation  strategies  include: 

1 .  continuously  updating  the  codebook  prototypes  to  fit  the 
testing  speech  spectral  vectors  xt. 


2.  continuously  transforming  the  testing  speech  spectral 
vectors  xt  into  normalized  vectors  yi,  so  that  the  dis¬ 
tribution  of  the  yi  is  close  to  that  of  the  training  data 
described  by  the  codebook  prototypes. 

Our  first  environment  adaptation  algorithm  belongs  to  the  first 
strategy,  while  two  cepstral  normalization  algorithms  which 
will  be  described  in  Section  4.2  belongs  to  the  second  strategy. 

Semi-continuous  HMMs  (SCHMMs)  or  tied  mixture  con¬ 
tinuous  HMMs  [9,  3]  has  been  proposed  to  extend  the  dis¬ 
crete  HMMs  by  replacing  discrete  output  distributions  with  a 
combination  of  the  original  discrete  output  probability  distri¬ 
butions  and  continuous  pdf’s  of  codebooks.  SCHMMs  can 
jointly  re-estimate  both  the  codebooks  and  HMM  parameters 
to  achieve  an  optimal  codebook/model  combination  according 
to  a  maximum  likelihoodcriterion  during  training.  They  have 
been  applied  to  several  recognition  systems  with  improved 
performance  over  discrete  HMMs  [9, 3]. 

The  codebooks  of  our  vocabulary-independent  system  can 
be  modified  to  optimize  the  probability  of  generating  data 
from  new  environmentby  the  vocabulary-independent  HMMs 
according  to  the  SCHMM  framework.  Let  denote  the  mean 
vector  of  codebook  index  i  in  the  original  codebook,  then  the 
new  vector  Jil  can  be  obtained  from  the  following  equation 

—  _■  Em(ELi  7r(^)xt) 

En,(ELi  Tr(<)) 

where  7j”(t)  denotes  the  posterior  probability  observed  the 
codeword  i  at  time  t  using  HMM  m  for  speech  vector  xt. 

Note  that  we  did  not  use  continuous  Gassian  pdf’s  to  rep¬ 
resent  the  codebooks  in  the  Equation  1.  Each  mean  vec¬ 
tor  of  the  new  codebook  is  computed  from  acoustic  vector 
Xt  associated  with  corresponding  posterior  probability  in  the 
discrete  forward-backward  algorithm  without  involving  con¬ 
tinuous  pdf  computation.  The  new  data  from  different  envi¬ 
ronment,  X(,  can  be  automatically  aligned  with  corresponding 
codeword  in  the  forward-backward  training  procedure.  If  the 
alignment  is  not  closely  associated  with  the  corresponding 
codeword  in  the  HMM  training  procedure,  reestimation  of 
the  corresponding  codeword  will  then  be  de-weighted  by  the 
posterior  probability  accordingly  in  order  to  adjust  the 
new  codebook  to  fit  the  new  data. 

4.2  Cepstral  Normalization 

The  types  of  environmental  factors  which  differ  in  TI’s  and 
CMU’s  recording  environments  can  roughly  be  classified  into 
two  complementary  categories : 

1.  additive  noise  -  noise  from  different  sources,  like  paper 
rustle,  keyboard  noise,  other  conversations,  etc. 
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2.  spectral  equalization  -  distortions  from  the  convolution 
of  the  speech  signal  with  an  unknown  channel,  like  posi¬ 
tions  of  microphones,  reverberation  from  surface  reflec¬ 
tions,  etc. 

Acero  at  al.  [  1 , 2]  proposed  a  series  of  environment  normal¬ 
ization  algorithms  based  on  joint  compensation  for  additive 
noise  and  equalization.  They  has  been  implemented  success¬ 
fully  on  SPHINX  to  achieve  robustness  to  different  micro¬ 
phones.  Among  those  algorithms,  codeword-dependent  cep- 
stral  normalization  (CDCN),  is  the  most  accurate  one,  while 
interpolated  SNR-dependent  cepstral  normalization  (ISDCN) 
is  the  most  efficient  one* .  In  this  study,  we  incorporate  these 
two  algorithms  to  make  our  vocabulary-independent  system 
more  robust  to  environmental  variations. 

X  =  z-w(q,n)  (2) 

Equation  2  is  the  environmental  compensation  model, 
where  x,  z,  w,  q  and  n  represent  respectively  the  normalized 
vector,  observed  vector,  correction  vector,  spectral  equaliza¬ 
tion  vector  and  noise  vector.  The  CDCN  algorithm  attempts 
to  determine  q  and  n  that  provide  an  ensemble  of  compen¬ 
sated  vectors  x  being  collectively  closest  to  the  set  of  locations 
of  legitimate  VQ  codewords.  The  correction  vector  w  will 
be  obtained  using  MMSE  estimator  based  on  q,  n  and  the 
codebook.  In  ISDCN,  q  and  n  were  determined  by  an  EM 
algorithm  aiming  at  minimizing  VQ  distortion.  The  final  cor¬ 
rection  vector  w  also  depends  on  the  instantaneous  SNR  of 
the  current  input  frame  using  a  sigmoid  function. 


Condition 

Error  Rate 

Error  Reduction 

Baseline 

5.4% 

N/A% 

-hVA  decision  trees 

4.9% 

9.3% 

-hVB  training 

4.6% 

14.8% 

-hVA  trees  &  VB  training 

4.6% 

14.8% 

Table  1:  The  results  for  Resource  Management  using 
vocabulary-adapted  decision  trees  and  vocabulary-bias  train¬ 
ing  algorithms 


to  further  tailor  the  vocabulary-independent  models  to  the 
Resource  Management  task,  no  compound  improvement  was 
produced.  It  might  be  because  either  both  algorithms  are 
learning  the  similar  characteristics  of  the  target  task,  or  the 
combination  of  these  two  algorithms  already  reaches  the  limi¬ 
tation  of  adaptation  capability  within  our  modeling  technique 
without  the  help  of  vocabulary-specific  data. 


Adaptation  Sentence 

CMU-TEST 

TI-TEST 

Baseline 

5.4% 

7.4% 

100 

N/A 

7.1% 

300 

N/A 

7.0% 

1000 

N/A 

7.0% 

2000 

N/A 

6.9% 

Table  2:  The  vocabulary-independent  results  on  TI-TEST  by 
adapting  the  codebooks  for  Tl’s  data 


5  Experiments  and  Results 

All  the  experiments  are  evaluated  on  the  speaker-independent 
DARPA  resource  management  task.  This  task  is  a  991-word 
continuous  task  and  a  standard  word-pair  grammar  with  per¬ 
plexity  60  was  used  throughout.  The  test  set,  TI-TEST,  con¬ 
sists  of  320  sentences  from  32  speakers  (a  random  selection 
from  June  1988,  February  1989  and  October  1990  DARPA 
evaluation  sets). 

In  order  to  isolate  the  influence  of  cross-environment  recog¬ 
nition,  another  identical  same  test  set,  CMU-TEST,  from 
32  speakers  (different  from  TI  speakers)  was  collected  at 
CMU.  Our  baseline  is  using  4-codebook  discrete  SPHINX 
and  decision-tree  based  generalized  allophones  as  the  VI  sub¬ 
word  units[7].  Table  1  shows  that  about  9%  error  reduction 
is  achieved  by  adapting  the  decision  trees  for  Resource  Man¬ 
agement  task,  while  about  15%  error  reduction  is  achieved  by 
using  vocabulary-bias  training  for  the  same  task.  Neverthe¬ 
less,  when  we  try  to  combine  these  two  adaptation  algorithms 

*The  reader  is  referred  to  [1]  for  detailed  CDCN  and  ISDCN  algorithms 


In  codebook  adaptation  experiments,  the  4  codebooks  used 
in  our  HMM-based  system  are  updated  according  Equation 
1.  We  randomly  select  100, 300, 1000, 2000  sentences  from 
TERM  database  to  form  different  adaptation  sets.  Two  iter¬ 
ation  were  carried  out  for  each  adaptation  sets  to  estimated 
the  new  codebooks  for  n’s  data,  while  the  HMM  parameters 
are  fixed.  Table  2  shows  the  adaptation  recognition  result  on 
H  testing  set.  It  is  indicated  that  only  marginal  improvement 
by  adapting  codebook  for  new  environment  even  with  lots  of 
adaptation  data.  The  result  suggested  that  the  adaptation  of 
codebook  alone  fail  to  produce  adequate  adaptation  because 
the  HMM  statistics  used  by  recognizer  have  not  been  updated. 

Table  3  shows  the  recognition  error  rate  on  two  test  sets  for 
VI  systems  incorporated  with  CDCN  and  ISDCN.  Be  aware 
that  our  VI  training  set  was  recorded  at  CMU.  The  degradation 
of  cross-environment  recognition  with  TI-TEST  is  roughly 
reduced  by  50%.  Like  most  environment  normalization  al¬ 
gorithms,  there  is  also  a  minor  performance  degradation  for 
same-environment  recognition  when  gaining  robustness  to 
other  environments. 
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Test  Set 

CMU-TEST 

TI-TEST 

Baseline 

5.4% 

7.4% 

CDCN 

5.6% 

6.4% 

ISDCN 

5.7% 

Table  3:  The  results  for  environment  normalization  using 
CDCN  &  ISDCN 


6  Conclusions 


In  this  paper,  we  have  presented  two  vocabulary  adaptation 
algorithms,  including  vocabulary-adapted  decision  trees  and 
vocabulary-bias  training,  that  improve  the  performance  of 
the  vocabulary-independent  system  on  the  target  task  by  tai¬ 
loring  the  VI  subword  models  to  he  target  vocabulary.  In 
91’  DARPA  Speech  and  Natural  Language  Workshop  [7],  we 
have  shown  that  our  VI  system  is  already  slightly  better  than 
our  VD  system.  With  these  two  adaptation  algorithms  which 
led  to  9%  and  15%  error  reduction  respectively  on  Resource 
Management  task,  the  resulting  VI  system  is  far  more  ac¬ 
curate  than  our  VD  system.  In  [8],  we  have  demonstrated 
improved  vocabulary-independent  results  with  vocabulary- 
specific  adaptation  data.  In  the  future,  we  plan  to  extend  our 
adaptation  algorithms  with  the  help  of  vocabulary-specific 
data  to  achieve  further  adaptation  with  the  target  vocabulary 
(task). 

CDCN  and  ISDCN  have  been  successfully  incorporated 
to  the  vocabulary-independent  system  and  reduce  the  degra¬ 
dation  of  VI  cross-environment  recognition  by  50%.  In  the 
future,  we  will  keep  investigating  new  environment  normal¬ 
ization  techniques  to  further  reduce  the  degradation  and  ulti¬ 
mately  achieve  the  full  environmental  robustness  across  dif¬ 
ferent  acoustic  environments.  Moreover,  environment  adap¬ 
tation  with  environment-specific  data  will  also  be  explored 
for  adapting  the  VI  system  to  the  new  environment  once  we 
have  more  knowledge  about  it. 

To  make  the  speech  recognition  system  more  robust  for 
new  vocabularies  and  new  environments  is  essential  to  make 
the  speech  recognition  application  feasible.  Our  results  have 
shown  that  plentiful  training  data,  careful  subword  model¬ 
ing  (decision-tree  based  generalized  allophones)  and  suit¬ 
able  environment  normalization  have  compensated  for  the 
lack  of  vocabulary  and  environment  specific  training.  With 
the  additional  help  of  vocabulary  adaptation,  the  vocabulary- 
independent  system  can  be  further  tailored  to  any  task  quickly 
and  cheaply,  and  therefore  facilitates  speech  applications 
tremendously. 
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